A Post by Michael B. Spring
The Federation and Balkanization of Information (June 18, 2009)
There is little doubt that we
are learning new ways to use information to transform economic and social
enterprises. In e-business, one of the key concepts I teach is the notion of
replacing inventory with information. The lecture is long and detailed, but
let it suffice here to suggest that it pretty easy to see that inventory
represents an investment of money and that it costs money (storage space,
pilferage, etc.) to store it. If we have perfect information about our needs,
we can manage inventory on demand. Thus, we replace expensive inventory with
cheap information. In similar ways, the use of the internet is having a dramatic
impact on our social system, from politics to health care. In the midst of all
this, we all have a clear sense of what information is, but good scientific
definitions elude us. I believe it is important that we have a better sense of
how measure information, to determine its worth, to understand how it flows and
is transformed, how it is aggregated and balkanized, etc. Information
balkanization is only one manifestation of the effort to control and manage
information. In a sense balkanization of information has served to some extent
to protect information about me, but there are signs of information federation
that allows partners to share information. I fear that costly balkanization
will soon give way to revenue generating federation.
Definitions of Information
Definitionally, there are a variety of ways to establish
information metrics. It is theoretically appealing to fall back to Shannon’s definition of information as a measure of the entropy in a signal.
Unfortunately, this definition does little to inform economic or social
policy. While information can be measured objectively in terms of entropy, it
is the impact it has on people and systems that may be more critical. The
common sense social definition – i.e. information is that which I don’t know
already – is interesting in that it makes all measure of information relative.
(I suspect you already knew that.) We might try to be more formal and say that
data that causes a change in the state of the receiving system is information.
We might try to build a more complete model and suggest that information
encoded into a system constitutes knowledge. This approach is appealing in
that it links the definitions of data, information, and knowledge. It is
unappealing in that it lacks a quantitative metric, and even if one existed, it
would need be heavily influenced by the individual receiving the data. (What
is information to you may not be information to me. At the same time, we might
both share the same knowledge.)
Public versus private information
One of the interesting
problems we face today is the aggregation of information about us on the web.
Much of this information is balkanized, and some of it is more federated than
we know. What is more interesting is that the information about us includes
both public information about us that we share (a photograph) as well as
private information we may volunteer in the anticipation that it will be kept
confidential (a credit card number or our birthdate). But there is also
information about us that can be gathered from our clicks that reveals things
about us that we might not know. This brings to mind the “Johari Window”
proposed by Joseph Luft and Harry Ingham in the 1950s as a conceptual model which
was used in counseling and self help groups. They suggest, in a narrow
context, that there are four categories of information about an individual:
Johari Window
|
Known to self
|
yes
|
no
|
Known to others
|
Yes
|
open
|
blind
|
No
|
hidden
|
unknown
|
It is surely the case that all
four types of information exist on the web. We have discussed three of them
already. The fourth, unknown, is simply waiting for the right data mining
technique.
Ownership and shared information
This leads one to interesting
notions of the ownership of information and the location of information.
Perhaps the most interesting information in this category is medical
information. Consider an individual’s medical record. Who owns this record?
Is it the physician who created it? Is it the patient it is about? The matter
may be further complicated by considering the components of a medical record.
Who owns the following:
- An x-ray of my lung.
- A reading of the x-ray
- A diagnosis based on the reading
I would like to
believe that my medical record, my employee record, my education record are all
my property, but it is not clear that is the case. I don’t believe any mail
system suggests that the author owns the content. I know at my University it
is the University, not me, who legally owns it. The challenges to the
perception that it is my personal infomration are few and far between and
generally meet acceptable social criteria for the intrusion, but they are
there.
“Order” of informaiton
Some information is primary – for example,
one might consider the statement that it rained in Pittsburgh primary. This
could be called first order information. Collecting these pieces of
information for a year, one might derive a piece of second order information
such as 2003 saw rain on 30% of the days in Pittsburgh. Collecting this
information for several years, one could derive a piece of third order
information – over 20 years, the average number of rainy days in Pittsburgh is decreasing. Information might also be categorized in terms of how it is
derived, and this might impact other properties – such as ownership. For
example, the fact that a person has a certain height or weight might be defined
as first order information. The representation of that information as a
certain number of inches or a certain number of pounds might be declared second
order information. The fact that a person is “overweight” is obtained by a
function of height and weight in accord with some algorithm. Is the
information that a person is “overweight” their information or does it belong
to the person who applied the algorithm?
High grade versus low grade
Some information that
describes me explicitly – my height, my medical condition – might be considered
high-grade information about me. Low grade information might include:
- What books at
amazon.com at which I look
- When I am logged
onto the internet
- What Microsoft
applications I use, or what I ask for help about
Do I have the same right of
ownership of high and low grade information. Indeed how is the ownership
established? When low grade information is anonymous and aggregated, who owns
it? Generally, I do not control this kind of information about me. When is
information explicitly or implicitly transferred? As with the shared ownership
issue, this category again asks who actually owns what portion of the
information.
Operations on Information
We lack formalizations for
operations on information. While we know how to copy artifact that contain information,
or delete them, it is not clear how do we determine if two pieces of
information are identical? How do we “subtract” or “add” two pieces of
information? How do we measure the change in information after
transformation. How do we transform information from first to second order?
From an end-user perspective,
the operational problems with information include such things as understanding
the ownership of information, dissemination of information about individuals,
tracking of information flows, etc. A variety of disciplines may contribute to
more operational and rigorous definitions of information.
We might be concerned about
keeping private information private – how do I secure information about me.
How do I control information I divulge to others? How do I insure it is not
reused without my permission? Is there a difference between selling my email
address to others, telling others I am looking for a mortgage, or telling
others I work at a University?
What is the role of
public-private key encryption in controlling access to “jointly-owned”
information – e.g. a physician can keep information about me, but it may only
be released when accompanied by both my and the physician’s private keys. How
does this impact medical research? How does this impact emergency care? What
happens to my information when I cease to exist? How does my privacy impact
the public right to be safe?
Can information be doped?
Explosives are tagged or doped with trace chemicals that allow the origin of
the explosive to be traced in the event that it is used in a crime. Like some
compression, the doping process is asymmetric. It is low cost to dope an
explosive and high cost to trace the tags. Like compression, the basic idea is
to keep the costs of the most frequent operation low and allow the costs of the
less frequent operation to grow. Thus, when a file is infrequently compressed
and very frequently decompressed, the cost of the compression can be high while
the cost of decompression is kept low. How might doping or watermarking be
used trace illicit use or dissemination of information?